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ABSTRACT 

Methods and systems are described for learning correlation rules used in event 
management. The key method consists of the steps of (a) marking one or more 
event groupings; (b) employing a machine learning program to learn the 
underlying concept of these groupings; (c) including a rule right-hand side; and 
(d) putting the new rule in the Rule DB. The system to implement this method 
consists of components for: (1) interactive visualization and user interface 
control; (2) query-based learning; (3) Event DB access; and (4) correlation Rule 
DB access. 
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FIELD OF THE INVENTION 



The present invention relates generally to network and systems management 
and more specifically to detecting and resolving availability and performance 
problems. 

BACKGROUND 

With the dramatic decline in the price of hardware and software, the cost of 
ownership for computing devices is increasingly dominated by network and 
systems management. Included here are tasks such as establishing 
configurations, help desk support, distributing software, and ensuring | the 
availability and performance of vital services. The latter is part.cularly important 
since inaccessible and/or slow services decrease revenues and degrade 
productivity. 

The first step in managing availability and performance is event 
management. Almost all computing devices have a capability whereby the onset 
of an exceptional condition results in the generation of a message so that 
potential problems are detected before they lead to widespread service 
degradation. Such exceptional conditions are referred to as events. Examples 
of events include: unreachable destinations, excessive CPU consumption and 
duplicate IP addresses. An event message contains multiple attributes 
especially: (a) the source of the event, (b) type of event, and (c) the time at which 
the event was generated. 

Event messages are sent to an event management system (EMS). An 
EMS has an adaptor that parses the event message and translate it into a 
normalized form (an adaptor). This normalized information is then placed into an 
Event DB Next, the normalized event is fed into a correlation engine that that 
determines actions to be taken. This determination is typically driven by 
correlation rules that are kept in a Rule DB. Examples of processing done by 
correlation rules includes: 

1 Elimination of duplicate messages. Duplicate is interpreted broadly here. 
For example, if multiple hosts on the same local area network generate a 
destination-unreachable message for the same destination, then the events 
contain the same information. 

2 Maintenance of operational state. State may be as simple as which devices 
are up and which are down. It may be more complex as well, especially for 
devices that have many intermediate states or special kinds of error conditions 
(e.g., printers) 
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3 Problem detection. A problem is present if one or more components of the 
system are not functioning properly. For example the controller in i a load 
balancing system may fail in a way so that new requests are a^,™? 6 *^ 
same back-end web server, a situation that can be tolerated at low loads but can 
fead to se*le degradation at high load. Providing early detection of such 
situations is important in order to ensure that problems do not lead to 
wide-spread service disruptions. 

4 Problem isolation. This involves determining the components that are 
causing the problem. For example, distributing a new release of an application 

5 has software errors can result in problems for all end-users connecting to 
servers with the updated application. Other examples of problem causes include, 
devtce Tailure, exceeding some internal limit (e.g., buffer capacity), and 
excessive resource demands. 

The correlation engine provides automation that is essential for delivering 
cost effective management of complex computing environments C^ent art 
provides three kinds of correlation. The first employs operational policies 
expressed f as rules (e.g., Milliken, 1986). Rules are if-then statements in which 
the if-part tests the values of attributes of individual event, and the then-part 
specmes actions to take. An example of such a rule is "If a hub generates an 
excessive number of interface-down events, then check if the software loaded on 
the hub is compatible with its hardware release." The industry experience has 
been that such rules are difficult to construct, especially if they mclude 
installation-specific information. 

Another approach has been developed by SMARTS (Yemini et a I.) based or. 
the concept of a codebook that matches a repertoire of known problems with 
event sequences observed during operation. Here, operational polices are 
models of problems and symptoms. Thus, accommodating new Problems 
-requires properly modeling their symptoms and incorporating their signatures into 
the code book. In theory this approach can accommodate ins all ^on-specific 
problems. However, doing so in practice is difficult because of the high level of 
sophistication required to encode installation-specific knowledge into rules. 

Recently a third approach to event correlation has been proposed 
(Computer Associates International, 1999). This approach trains a neural 
network to predict future occurrences of events based on the frequency of their 
occurrence in historical data. Typically, events are specrf.ed based on 
thresholds such as CPU utilization exceeding 90%. The policy execution system 
uses the neural network to determine the likelihood of one of the previously 
specified events occurring at some time in the future. While this technique can 
provide advanced knowledge of the occurrence of an event, it still requires 
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specifying the events themselves. At a minimum, such a specification requires 
detailing the following: 

1 . The variable measured (e.g., CPU utilization) 

2. The directional change that considered (e.g., too large) 

3. The threshold value (e.g., 90%) 

The last item can be obtained automatically from examining representative 
historical data. Further, graphical user interfaces can provide a means to input 
the information in items (2) and (3). However, it is often very drff.cu for 
installations to choose which variables should be measured and the directional 
change that constitutes an exceptional situation. 

To summarize, current art for event manage systems is of three types. The 
first (eg as in Milliken, 1986) requires that correlation rules be specified by 
experts a process that is time-consuming and expensive. The second (e.g., as 
in Yemini) reduces the involvement of experts but only for aspects of event 
management that share broad commonalties (e.g., IP connectivity). The third 
Tea Computer Associates International, 1999) attempts to automate the 
construction of correlation rules for a broader range of management areas. 
However to date, this has not been done in a manner that provides for 
customization by experts, especially in a way that avoids dealing with low-level 
details (e.g., specific threshold values, the choice of measurement values, and 
directional changes of interest for these variables). 

More broadly, no existing EMS provides decision support for constructing 
correlation rules, which we refer to as machine-aided rule construction. At 
best existing art provides authoring systems that aid in syntax checking and 
formatting. However, no assistance is provided for translating examples of event 
patterns (drawn from historical data) into correlation rules. 



SUMMARY OF THE INVENTION 



The present invention addresses the problem of decision support for 
constructing correlation rules for event management. The invention includes 
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systems and methods for visualizing event data and machine-based processing 
of these data to aid in rule construction. 

Providing decision support for rule construction requires capabilities for 
visualizing and describing patterns in terms of rule left-hand sides In the area of 
v sualization, our starting point is the work described .n Ma and HeNerste.n, 1 999. 
Also relevant here is Rogowitz, Rabenhorst, and Treinish, 1999 wh.ch describes 
a way to provide preferred visualizations. 

Our invention makes use of machine learning algorithms to describe patterns 
in terms of rules. The framework we adopt is learning concepts expressed as 
predicates on attributes (e.g., as in Mitchell, 1997). In essence ^ a concept is a 
where-clause as expressed in the structured query language (SQL). An example 

is 

All events originate from subnet 15.2.3 and the event rate exceeds .75 per 
second. 

Here the attribute subnet must have the value 1 5.2.3 and the total number of 
events divided by the time-span in seconds of the group must exceed .75. 

Learning concepts is greatly facilitated by using one or more abstraction 
hierarchy. An example of a two level hierarchy is: (a) the host itself (e.g., 
yahoo.com) and (b) its type (e.g., file server, web server, name server). A rn°'"® 
extensive hierarchy might be based on the kind of interactions (workload) being 
done such as: (i) the specific transaction, (ii) the user performing the transartion 
(iii) the user's department, and (iv) the division in which the user works. If both 
cases we have containment in that higher levels encompass lower ones In 
event 'management, there are often multiple hierarchies (e.g., time, configuration, 
workload, event type). 

A broad class of learning algorithms we consider (eg, 
generalization-specialization algorithms as in Mitchell, 1997) uses abstraction 
hierarchies in two ways. First, when a positive example is encountered that is not 
covered by the current set of predicates, the level of one or more abstraction 
hiearchies is increased to include this example. Second when a negative 
example is encountered that is covered by the predicate, the level of one or 
more abstraction hierarchies is decreased. Various schemes are used to 
optimize that hierarchy levels chosen to maximize the number of positive 
examples covered and minimize the number of negative examples covered.. 
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The main method of our invention consists of a series of . interactions 
between an analyst-a domain expert-and our system (hereafter referred to as 
thrmlchine) whereby correlation rules are constructed. This method consists of 
steps for (1) the analyst marking one or more event groupings; (2) the machine 
Teaming the left-hand side for event patterns; (3) the analyst adding the 
right-hand side; and (4) the machine placing the rule in the Rule DB that is used 
by the correlation engine of the EMS. Step (1) is greatly aided by having an 
effective tool for visualizing and interacting with event groups. Step (2) employs 
machine-learning techniques, especially query-based learning , and 
Generalization-specialization hierarchies that allow a machine to choose the best 
level of an abstraction hierarchy to cover positive examples of an event pattern 
(and avoid negative examples). 

To elaborate, in our invention learning a left-hand side means determining 
the predicates necessary to describe a set of event groupings. Predicates 
consist of logical statements about attribute values^ For exarnp^ it may be that 
event groups are characterized originating from hubs, on subnet 9.2.16, wrth an 
event rate of .5/second. Then, we want a learning algorithm to determine these 
predicates. 

The foregoing method may be augmented by incorporating data mining 
algorithms to aid in finding patterns of interest. These patterns can, in essence, 
seed the process of finding left-hand sides of rules. 

The system of our invention includes components for interactive 
visualization, learning event patterns, rule construction, event data access ; and 
Rule DB access. The visualization system in conjunction with event data access 
provide a means for analysts to select event groupings that are then translated 
into left-hand sides by the pattern learner. Rule construction in combination wrth 
Rule DB access provide a means for adding the rule right-hand side and placing 
the result in the Rule DB. 

Considerable benefits accrue from our invention. First, the construction of 
correlation rules is made easier in that left-hand sides of rules can be generated 
automatically. Clearly, this is a productivity benefit in that expressing left-hand 
sides can be time consuming. In addition, this capability can allow those who are 
knowledgeable about operations to develop rules even though they may not be 
trained in constructing correlation rules. 

Another benefit of the present invention relates to the assessment of 
correlation rules once they are constructed. In existing art - rule ^^ e i e ,^^ e ^ b t y 
using them in production systems. While in some sense th.s .s the ultimate tes^ 
rt may be some time before a situation arise when the rule » invoked. A 
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complementary approach is to apply the rule's left-hand side to historical date 
selecting instances of the patterns specified by the rule. By so doing the 
operations staff can determine if the situations for which the rule is intended are 
in fact those that will be selected in production. 



BRIEF DESCRIPTION OF DRAWINGS 

Fig. 1 shows the overall architecture in which our invention operates. 

Fig. 2 displays a visualization used to identify groupings of events when 
learning the left-hand side of rules. 

Fig. 3 shows the process for machine-aided rule construction. 

Fig. 4 describes the process for query-based learning of a rule left-hand 
side. 

Fig 5 displays some hierarchies used in employing the 
generalization-specialization algorithm to learn rule left-hand sides. 

Fig. 6 illustrates the system for machine-aided rule construction. 

Fig. 7 shows the system in our invention for pattern learning 

Fig. 8 displays a user interface for inputting the information in box 500 in Fig. 

4. 

Fig. 9 displays a user interface for presenting the results of box 520 in Fig. 4. 

Fig. 10 displays a user interface for accomplishing box 530 of Fig. 4. 

EMBODIMENT 

Fia 1 displays the overall architecture of the environment in which our 
invention operates. An operator (100) receives alerts and initiates responding 
actions based on interactions with the event management system (110) that 
receives events (170) from various devices, such as file servers (140) name 
servers (150), and mail servers (160). The event management system updates 
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the Event DB (180) with newly received events and reads this DB to do event 
based on the Rule DB (185). An analyst (120) uses the event 
manaqement decision support system (130) of our mvent.on to develop he 
SSSon rules used by the event management system to control the 
int^actions with the operator. Doing so requires reading histor.cal event data .n 
the Event DB and writing the Rule DB. 

Fig 2 illustrates the kind of display used by the event ma ™f m * n \ d *f™] 
support system to identify patterns that should be translated into the left-hand 
side rules. Many patterns may be present, including per.od.ct.es and event 
bursts. Also, patterns may exist at multiple time-scales. 

Fiq 3 displays the process for machine-aided rule construction in our 
invention Boxes beginning with an "A" are performed by an analyst human), 
hole that begin with an "M" are done by the machine; and those with A,M are 
done coNaboratively by the analyst and the machine. In 410, the analyst and the 
mTcNnetllaboraL to learn the left-hand side of rules , based on pat erns 
identified in visualizations such as those displayed in Fig. 2. In 420, the analyst 
augments the left-hand side with a right-side action. In 430, tr^-nepta£ 
the new rule in the Rule DB of the event management system (which is box 330 

in Fig. 6). 

Fig 4 provides the details of step 410 in Fig. 3. In 500, the analyst marks 
one or more event groupings and indicates if they are positive or negative 
examples of the problem to be detected in the rule's left-hand side. In 505 the 
machine learns a concept using the positive and negat.ve examples provided by 
SS^vstTn 5lS the machine determines if there are a sufficient number of 
examples to learn the rule left-hand side. If there are, the flow proceeds to step 
420 If there is not, the machine looks for similar patterns based on the rule 
constructed so far. In 530, the analyst critiques the result by determining if the 
examples to date accurately reflect the concept to be identified. Th.s may 
involve: (a) reclassifying a positive example as a negative example or a negat. ve 
example as a positive example; (b) deleting examples; and (c) '"eluding or 
excluding events in an example so that it better conforms with the concept being 
teamed Ir ^540 the analyst may optionally adjust the parameters of the learner 
lo better operate with the" concept being learned. Then, the flow continues with 
box 505. 

To elaborate on step 520, consider the preliminary concept "there is a 
port-down event followed by a port-up event from the same host .n w.th.n 5 
seconds" The machine seeks other examples of such an event sequence from a 
sfnde host One way this can be done is for the machine to do a SQL query that 
^eves all event interface-down events. Then for each, the machine also 
retrieves the events that occurred over the next five seconds from that same 
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host The machine then checks if one of these events is an interface-up. For 
mose hosts that this is the case, the machine then reports the ent.re sequence of 
events from interface-down through interface-up. 

Fig. 5 displays examples of the hierarchies used to learn concepts for the 
left-hand side of rules. Four hierarchies are shown. In 600, there is a t me 
hierarchy consisting of work shift (610), hour of the day (625), and minute within 
Sie tou (630). In 605, there is a configuration element hierarchy ^ons.st.ng o 
the tvpe ofhost (635) e.g., mail server, file server) and the host identifier (640) 
e g its Internet address) in 610, there is a workload hierarchy cons.st.ng of the 
S^ion n which the user is employed (645), the department (650) and user s 
ntmT (655) and the transaction (or work unit) being performed by the user 
(660) n 615 there is a hierarchy of event types consisting of the situation in 
which the event occurs (e.g., printer failure) (665), the nature of the action (670), 
and the specifics of the element itself (675). 

Fig 6 displays the components of the event management decision support 
^tem M30) The interactive visualization and user interface control (700) 
pSsove^ll control of the interactions with the analyst (120) and the jf bw 
win the event management decision support system^ The pat terr .lea ner (710 
\s invoked to perform step 410 described in Fig. 3, and the rule constructor (730) 
ZatorTon with the'Rule DB access component (725) are used to perform 
step 430 in Fig. 3, which involves reading and writing the Rule DB. ' he event 
mfner (705) and event data access (720) components are used in combraton to 
aid in visualization and similarity query functionary used in step 520 of Fig. 4, 
which requires reading the Event DB. 

Fig. 7 details the elements of the pattern learner (710) in Fig. .6. The .event 
visualization and control component (800) controls interact.ons with the ana yst 
for purposes of learning event patterns. 800 also controls the flow within the 
pa7eXrne° including 9 queries^ the Event DB via ^^ ( ^^ 
(805) which in turn invokes the Event Data Access component (810) to read 
from the Event DB. In addition, 800 invokes the Pattern Inference component 
815) to determine possible patterns in the set of positive and negate 
examples and establishes hierarchies used by the Hierarchy Manipulator (825) 
^s wtoyed by 815. Event Visualization and Control also updates the set of 
X aRs examples (820) and invokes the Similarity Query Engine 
Sato aid in finding other positive and negative examples. Doing so requires 
specifv^rnumerical distances between patterns, which 800 specf.es through 
wTthe Distance Calculator (835), a component that ,s .nvoked by 
the Similarity Query Engine. 

Fig 8 9 and 10 illustrate how analysts interact with the system in our 
invention to construct correlation rules. Fig. 8 illustrates a user .nterface that can 
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be employed to achieve box 500 in Fig. 4. Seen here is a scatter plot. The i x-ax s 
is time and the y-axis is the host from which an event originated. The latter is 
categorical and so can be ordered in any manner that is convenient for 
visualizing patterns. The user has selected a set of events between times and 
9 and for hosts C, B, and A. This is an example of an underlying concept that the 
analyst knows but cannot articulate in a computer encoded manner. For the 
purposes of this example, the true concept contains all events that are on subnet 
15.2.3 and for which the rate of the event set (count divided by time) exceeds 
.5/second. 

Fiq 9 displays the results of steps 505, 510, and 520 in Fig. 4. Here, the 
machine has inferred from the example of Fig. 8 the concep t that even ts 
ordinate from a Hub and that the rate exceeds .75/second. We see two new 
patterns that have been found by the machine that are consistent with this 
concept. 

Fiq 10 shows how the analyst accomplishes step 530 in Fig. 4. The analyst 
has edited the patterns found by the machine to be consistent with the analyst s 
concept. Specifically, in one case an event is excluded; in the other, a previously 
excluded event is included. 



CLAIMS 



1 System for containing an event visualization, data access component, event 
group selection and editing, a similarity query, and a concept learner component 
to provides for learning situations. 

2 Method for the system in claim (1) for learning situations that includes the 
steps of marking event groups as positive and negative examples, (2) inferring 
concepts from the examples, and (3) translating this into the left-hand side of a 
rule. 

3 A method as in claim 2 in which the learning step includes: (a) learning an 
initial concept, (b) determining if this is statistically significant given the data (c) 
querying historical event data for similar event groupings, and (d) allowing the 
user to critique the result. 

4. Concept hierarchy used by the method in claim (2) that includes hierarchies 
for hosts and event types. 
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